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Abstract 

Background: Accurate catalogs of structural variants (SVs) in mammalian genomes are necessary to elucidate the 
potential mechanisms that drive SV formation and to assess their functional impact. Next generation sequencing 
methods for SV detection are an advance on array-based methods, but are almost exclusively limited to four basic 
types: deletions, insertions, inversions and copy number gains. 

Results: By visual inspection of 100 Mbp of genome to which next generation sequence data from 17 inbred 
mouse strains had been aligned, we identify and interpret 21 paired-end mapping patterns, which we validate by 
PCR. These paired-end mapping patterns reveal a greater diversity and complexity in SVs than previously 
recognized. In addition, Sanger-based sequence analysis of 4,176 breakpoints at 261 SV sites reveal additional 
complexity at approximately a quarter of structural variants analyzed. We find micro-deletions and micro-insertions 
at SV breakpoints, ranging from 1 to 107 bp, and SNPs that extend breakpoint micro-homology and may catalyze 
SV formation. 

Conclusions: An integrative approach using experimental analyses to train computational SV calling is essential for 
the accurate resolution of the architecture of SVs. We find considerable complexity in SV formation; about a quarter 
of SVs in the mouse are composed of a complex mixture of deletion, insertion, inversion and copy number gain. 
Computational methods can be adapted to identify most paired-end mapping patterns. 



Background 

The identification of structural variants (SVs) in mamma- 
lian genomes [1-4] has important implications for our 
understanding of genetic diversity, has elucidated the 
concept of genomic disorders [5,6] and has improved the 
analysis of genetic association in common and rare dis- 
eases [7-12], cancer development [13] and genomic evo- 
lution [14,15]. However, the accurate identification of 
SVs in mammalian genomes remains challenging. 

Next generation sequencing provides a novel approach 
for identifying structural variants [16] and exploits read- 
pair information [17,18], split reads [19,20], read depth 
[21] and sequence assembly [22] to localize SVs. Typically, 
variation in the expected number of reads mapping to the 
reference sequence is used to identify copy number var- 
iants while deviations from the expected distance between 
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reads, and the orientation of reads, is used to infer the pre- 
sence and type of structural variant at a locus. These 
methods presuppose that sequencing reads form charac- 
teristic patterns for different types of structural variants 
[23]. For example, when the two sequenced ends of a frag- 
ment map back to the reference genome in the correct 
orientation, but at a distance that is significantly larger 
than the size of the fragment itself (as inferred from the 
library insert size distribution), this indicates a deletion. 

Algorithms that use whole-genome sequence reads 
make assumptions about the paired-end mapping (PEM) 
patterns they will encounter, even though we know that 
the molecular architecture of structural variants can be 
remarkably complex [24,25]. For example, deletion and 
inversion events that appear simple may contain additional 
sequence at breakpoints and different types of structural 
variants sometimes occur together, so that, for example, 
an inversion immediately abuts a deletion [26] . However, 
current automated methods to identify SVs are unable to 
differentiate basic patterns (for example, a simple 
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inversion) from more complex ones (for example, an 
inversion right next to a deletion), resulting in some SVs 
being incorrectly classified while others are missed 
altogether. 

Past studies have described several forms of complex 
structural variants, ranging from multiple rearrangements 
at large genomic loci [27,28], to deletions, inversions, 
insertions and duplications that fall in close proximity 
[29]. More recently, a subtle form of complex SVs has 
been characterized by micro-insertions or micro-deletions 
at the breakpoints of larger SVs [30]. In our present study, 
we describe complex SVs as two or more structural var- 
iants co-occurring at the same locus, without intervening 
DNA of normal structure between the variants (SVs are 
directly adjacent to each other) and without distinction by 
SV size (complex SVs can be two or more large SVs right 
next to each other or a small SV right next to a larger SV). 

Here we combine visual inspection of PEM data from 17 
mouse genomes [31] with experimental validation to 
resolve the molecular architecture of SVs and to guide 
genome-wide computational analysis [32]. We provide a 
comprehensive catalog of 21 PEM patterns derived from 
simple and complex SVs, and show how these patterns 
may provide insights into the fine-scale molecular archi- 
tecture of SV formation. 

Results 

Catalog of paired-end mapping patterns 

We started by generating a set of validated PEM patterns 
that we could use to guide genome-wide computational 
analysis. To do this, we visually examined short- read PEM 
patterns and manually called SVs from 100 Mbp in 17 
inbred strains of mice [31,32] (A/J, AKR/J, BALB/cJ, C3H/ 
HeJ, C57BL/6NJ, CBA/J, DBA/2J, LP/J, 129S5SvEv^'^ 
129P2/01aHsd, 129Sl/SvImJ, NOD/ShiLtJ, NZO/HlLtJ, 
CAST/EiJ, PWK/PhJ, WSB/EiJ and SPRET/EiJ) that 
included the whole of mouse chromosome 19 (61 Mbp in 
size), and a random set of other chromosomal regions. We 
provide an overview of the procedure to catalog PEM pat- 
terns in Figure la, b. 

Based on read depth and anomalous PEM, we identi- 
fied 21 patterns, as described in Table 1 and Additional 
file 1. We unambiguously classified 11 PEM patterns, 
referred to as 'H' patterns, for high confidence. While 
some of the H patterns are typical and have already been 
described [23], others (H3, H5, H9 and Hll) are novel. 
Figure 2a shows the novel PEM pattern H5, an inversion 
directly flanked by two deletions. Note that depending on 
the size of the inversion, the H5 pattern of paired-end 
reads will differ: for instance, suppose the length of the 
inversion is small, H5 reads will span both deletions and 
inversions, giving a PEM pattern of a typical deletion; 
suppose now the inversion is larger, H5 reads will 



individually span each deletion, giving a PEM pattern of 
an inversion. 

Interpretation of the remaining ten PEM patterns was 
ambiguous. We refer to these as type Q ('Questionable') 
patterns (Ql to QIO; Table 1; Additional file 1). With the 
exception of Q4 (large deletion), the remaining Q pat- 
terns have not been described before. Two patterns 
appeared false positives (Q6 and QIO). Ql and Q7 were 
due to variable number tandem repeats. Q5 and Q9 were 
difficult to interpret: read-pair information suggested a 
structural variant while read depth did not. Q2, Q3 and 
Q8 had partial patterns compared to typical SV patterns. 
For example, Q2 has a read depth of zero flanked by 
regions of normal read depth but it does not have paired- 
end reads spanning the length of the variant as does a 
typical deletion. 

We found that partial PEM patterns were caused by the 
presence of a de novo insertion right next to a deletion or 
inversion. Figure 2b shows an example of an inversion 
flanked by a de novo insertion. A paired-end read span- 
ning the first breakpoint of the inversion is mapped as 
expected to the reference genome. However, one end of 
the fragment spanning the second breakpoint of the 
inversion does not map to the reference genome (because 
it lies within the inserted sequence), creating a signature 
of an orphaned' read mate. 

From the 100 Mbp we visually inspected, we identified 
a total of 1,494 SVs that matched the 21 PEM patterns 
(Additional file 1). Because visual identification of H6 
and H7 patterns was more difficult than for the other 
variants, we excluded them from our analysis of chromo- 
some 19 and identified 872 deletions (631 type H and 
241 type Q) bigger than 100 bp, 15 inversions (2 type H 
and 13 type Q) and 3 copy number gains (all type H) 
(Additional file 2). In addition to the 890 SVs identified 
on chromosome 19, we found 604 on the other chromo- 
somes. We refer to these lists on chromosome 19 and 
other chromosomal regions as our gold-standard list of 
PEM patterns. 

Since we examined the whole of chromosome 19 in 
eight strains, we looked at the distribution of SVs along 
the chromosome in the context of regional features. To 
do this, we counted the number of SVs overlapping pro- 
tein-coding genes, coding exons and repeat regions on 
chromosome 19 using Ensembl build 65, and we com- 
pared this to a null distribution of the expected number 
of overlaps, obtained by performing a permutation ana- 
lysis. Across all strains, we found a non-random distri- 
bution of SVs along the chromosome (Additional file 3) 
with enrichment {P < 0.01, fold change 2.2) in repeat 
regions and depletion {P < 0.01, fold change 0.25) in 
coding regions. We found only two SVs on chromosome 
19 that affect one, or more, coding exons of genes 
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(a) Inspection of 100 Mb of NGS data in 17 genomes 
(whole chr19 + random set of other regions) 



(b) Identification of 21 PEM patterns across 1 ,494 SVs 
(detailed description of the 21 PEM in Additional file 1 ) 

Example of two simple PEM pattems 



H1 : Deletion of unique sequence 

Chr1 :7904302-7904691 bp [11 11 01 01 ] 




H2: Deletion of repeat sequence 

ChrS: 31 5951 28-31 595783 bp [1 1 1 1 01 1 1 ] 



< — Paired-end reads- --► 
mapping to tine 
reference genome 



Coverage (0-20) — ► 



(c) PGR validation of PEIVI patterns at 662 SV sites 
PGR results for the two deletions H1 and H2 
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(d) Sanger -based sequencing at 261 simple SV sites 

ACATAAATCATATTCATAAATCATAATCAGATTQ. 



F 



34 bp 



547 bp 



R 



Micro -insertion 



H1 deletion (chr1 0:20467348 -20467894) 



Training of genome -wide computational analysis 

Figure 1 An overview of the procedure to characterize structural variants. A flow diagram of the different steps we used to characterize 
structural variants, (a) We first inspected 100 Mbp of next generation sequencing (NGS) data in 17 genomes. We examined chromosome 19 in 
its entirety and a random set of other chromosomal regions, (b) We identified 21 PEM patterns across 1,494 SV sites. We show two examples of 
PEM patterns, HI and H2, as visualized using the LookSeq tool [38]. The HI deletion is on chromosome 1 and has strain distribution pattern 
1 1 1 10101 (1 means presence and 0 absence of the deletion) in the following strain order: A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J 
and LP/J. The H2 deletion is on chromosome 8 and is present in seven strains, (c) We randomly selected 662 SV sites for PCR-validation to 
investigate all PEM patterns and show the results for the two deletion SVs. (d) We randomly selected 261 SV sites for analysis of breakpoint 
sequence features using Sanger-based sequencing technology. We show sequencing data of a simple deletion of type HI on chromosome 10 
(20,467,348-20,467,894). Sequence analysis confirmed the deletion of 547 bp but also revealed an insertion of 34 bp. 
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Table 1 The 21 PEM patterns with their corresponding SV type 


PEM pattern 


Brief description 


SV type 


HI 


Deletion of unique sequence 


Simple 


H2 


Deletion of repeat sequence (for example, LINE, SINE, ERV) 


Simple 


H3 


Deletions separated by small normal copy (Del+Nml+Del) 


Simple 


H4 


Typical inversion 


Simple 


H5 


Inversion co-occurring with deletion(s) 


Complex 


H6 


Insertion of unique sequence [dc novo sequence) 


Simple 


H7 


Insertion of repeat sequence (for example, LINE, SINE, ERV) 


Simple 


H8 


Tandem duplication 


Simple 


H9 


Inverted tandem duplication 


Complex 


H10 


Dispersed copy number gains 


Simple 


H1 1 


Deletion or inversion within copy number gain 


Complex 


Q1 


Deletion due to microsatellite size polymorphism 


Simple 


Q2 


Deletion of unique sequence co-occurring with insertion 


Complex 


Q3 


Deletion of repeat sequence co-occurring with insertion 


Complex 


Q4 


Large deletion 


Simple 


Q5 


Linked small gain causing a false deletion 


Simple 


Q6 


False deletion due to retrotransposed pseudogene 


False 


Q7 


Deletion due to VNTR 


VNTR 


Q8 


Inversion co-occurring with insertion 


Complex 


Q9 


Inverted linked small gain causing a false inversion 


Simple 


QIC 


False inversion due to inverted retrotransposed pseudogene 


False 



ERV, endogenous retrovirus; VNTR, variable number tandem repeat. 



involved either in immunity or olfaction (Additional 
file 4). 

Architecture of SVs using PEM inspection 

Next we tested the molecular architecture of SVs as 
inferred by our visual inspection of PEM patterns (Figure 
Ic). To do this, we applied a primer design strategy 
depending on type and length of the SV (Additional file 5) 
and confirmed the underlying molecular structure of all 
21 PEM patterns using PGR- and Sanger-based sequen- 
cing across 8 (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, 
CBA/J, DBA/2J and LP/J) of the 17 Mouse Genomes Pro- 
ject strains [31]. These eight strains are the progenitors of 
the HS (heterogeneous stock) [33], an outbred population 
we have used to achieve genome-wide high-resolution 
mapping of multiple phenotypes [34]. 

We designed 742 pairs of primers (Additional file 6) and 
successfully amplified 662 SV sites (Additional file 7). It 
should be noted that we excluded failed designs (due to 
the presence of SNP(s) in the primer sequences) and 
designed 80 additional primer pairs to amplif)^ sites when 
one primer pair yielded no information or only part of 
the answer - for instance, for a deletion adjacent to an 
insertion predicted by visual inspection of the PEM. With 
the exception of insertions (excluded from our chromo- 
some 19 data set), SV sites we analyzed were representa- 
tive of the overall distribution of PEM categories 
(Additional file 8). 



We defined simple SVs as those whose structural inter- 
pretation is straightforward and consists of one SV type: 
insertions, deletions and inversions (Figure 3a). We also 
identify another type of insertion, a copy number gain, as 
consisting of non-repetitive DNA that is present in multi- 
ple copies relative to the reference genome. When this 
sequence occurs immediately adjacent to its original, it is 
annotated as tandem duplication; when it is small and 
close to another copy, it is annotated as a linked gain. 

In contrast to previous SV studies that use the number 
of breakpoints that fall in close proximity, our definition 
of complex SVs is based on the mixture of SV types (of 
small or large size) that directly abut each other, with no 
intervening DNA, since these might be the progeny of a 
single process (marked as Del+Ins, Del(s)+Inv and Inv 
+Ins in Figure 3b). We also separately identify an SV 
within a copy number gain (termed 'Del in gain' and Inv 
in gain in Figure 3b) since the probability of coincidence 
is less than one event per genome. 

Our categorization of predicted SV structures, based on 
manual inspection of PEM patterns, resulted in the highly 
confident identification of a structural variant for 18 of the 
21 patterns that we examined by PGR: 12 were indicative 
of a simple SV and 6 of a complex SV (Table 1). Two pat- 
terns did not represent structural variants (Q6 and QIO), 
but were due to the presence of a retrotransposed pseudo- 
gene, which caused false SV patterns. SVs of type Q7 (55 
cases) were due to variable number tandem repeats, for 
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Figure 2 Novel PEM patterns, (a) PEM pattern of a Del(s)+lnv (H5). We draw paired-end reads (black arrows) and how they map to the 
reference (Ref.) genome C57BL/6J (dashed grey lines). Blue boxes represent deletions and orange boxes inversions. Green arrows represent 
primers used for PGR amplification and sequencing reactions, (b) PEM pattern of an Inv+lns (Q8), with PGR data across eight classical strains (A/J, 
AKR/J, BALB/cJ, GSH/HeJ, G57BL/6J, GBA/J, DBA/2J and LP/J). The pink box represents de novo sequence insertion. The amplicon size for BALB/cJ, 
G3H/HeJ, GBA/J and DBA/2J is about 500 bp larger than the other strains, indicative of the insertion. 



which we could not predict the number of repeats or 
molecular structure (Additional file 8). 

We estimated the relative proportions of simple and 
complex SVs by manual inspection of PEM patterns on 
chromosome 19. Assuming an equal number of deletions 



and insertions on chromosome 19, then about 88% of SVs 
are composed of one SV, 2.5% of two adjacent SVs at the 
same locus and 9.5% are variable number tandem repeats 
(Additional file 8). Note that we have not recorded SVs on 
chromosome 19 where three (or possibly more) different 
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(a) 

Del 
Ins 

Inv 

tandem Dup 
copy number gain 
linked gain 



Del-Nml-Del 



(b) 



Del(s)+lnv 



Inv+lns 



Reference allele 



Alternative allele 



1 



1 1 
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1 


2 ^ 


3 






1 r 
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4fl 5 
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CN 


3 



Inv in gam 
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2a 


2b 


2c 


3 




Del in gain 


1 


2a 


2b 


2c 


3 






2a 


2b 


2c 


3 
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2c 
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2b|2c 


3 


2a 


w 


2c 



Pattern relative to the reference: | Del | InsQ InvQ copy number gain 

Figure 3 Architecture of structural variants, (a) Simple SVs: deletion (Del), insertion (Ins), inversion (Inv), tandem duplication (tandem Dup) 
and other types of copy number gains. Linked gain is a small copy number gain at close proximity to its copy. Inverted linked gain (not drawn) 
is similar to a linked gain but the copy is inverted. Del+Nml+Del is two deletions separated by a normal copy of small size, (b) Complex SVs: 
deletion co-occurring with insertion (Del+lns), inversion with flanking deletions (Del(s)+lnv), inversion with insertion (Inv+lns), deletion within a 
copy number gain (Del in gain) and inversion within a copy number gain (Inv in gain). 



types of SVs co-occurred (for example, a deletion right 
next to an inversion and insertion), although about a 
dozen rearrangements had three different types of SVs 
based on their PEM patterns. Consequently, our estimate 
of the number of complex SVs on chromosome 19 based 
on PEM inspection is likely to be conservative. 

Fine-scale architecture of simple SVs 

To gain insights into the fine scale architecture of simple 
SVs as inferred by our visual inspection of PEM patterns, 
we randomly selected 261 simple SV sites and analyzed 
their breakpoints (Figure Id; Additional file 8). Using the 



rat as an outgroup species, we inferred SV ancestry (as 
described in [32]), and classified SVs into two groups 
(ancestral insertion or deletion). We found 105 ancestral 
insertions and 142 ancestral deletions; the remaining 14 
were either inversions {n = 6) or tandem duplications 
{n = 8). We sequenced a total of 4,176 breakpoints at 261 
selected SV regions in 8 strains using PCR-based Sanger 
sequencing (Additional file 9). 

We found additional complexities at breakpoints of 62 
SVs (24%): small additional (micro-insertions) and small 
missing (micro-deletions) sequences (Table 2). Figure Id 
shows an example of a simple HI deletion; sequence 
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Table 2 Sanger-based sequence analysis at 4,176 breakpoints 



Sequence features at breakpoint 



Ancestral insertion 



Ancestral deletion 



Inversion 



Gain 



Micro-deletion 

None 

1-34 bp 

>200 bp 
Micro-insertion 

None 

I- 10 bp 

II- 50 bp 
>51 bp 

Botli micro-deletion and -insertion 

I- 10 bp 

II- 50 bp 
>51 bp 

Total simple SVs analyzed = 261 



14.3% 
1 .0% 



105 



73.2% 
19.7% 
5.6% 
1 .4% 



142 



).7% 



16.7% 
16.7% 
6 



87.5% 
12.5% 



analysis at the nucleotide level confirmed the deletion of 
547 bp but also revealed an insertion of 34 bp. 

We determined the extent and content of micro-inser- 
tions and micro-deletions and identified three patterns 
of SV breakpoints (classified in Table 2). The first pat- 
tern is characterized by micro-deletions at SV break- 
points. At 15% of ancestral insertions there were 
missing nucleotides at the breakpoints, ranging from 1 
to 289 bp. 

The second pattern includes SVs that have sequence 
inserted at their breakpoints; 27% of ancestral deletions 
showed a micro-insertion, with size ranging from 1 to 107 
bp. We report the origin of the sequence involved in 
micro-insertions in Additional file 9. There were three 
cases: (i) intra- or (ii) inter-chromosomal copy number 
gain of small size, or (iii) insertion of retrotransposons. 

The third pattern of SV breakpoints is characterized by 
simple SVs with both occurrence of micro-deletion and 
micro-insertion. One third of inversions had this pattern 
at their breakpoint. Size of SVs was not correlated to one 
particular pattern, nor to micro-deletion/ micro-insertion 
length and type. 

Our analysis of breakpoint sequence features in multiple 
strains also allowed us to look for a relationship between 
sequence variants and SV formation. In particular, we 
addressed the question as to whether sequence variants at 
breakpoints were associated with SV formation. 

In all cases, the presence of SNPs in the micro-homology 
region (short length of identical sequence at an ancestral 
deletion's start and end points) was correlated with the 
presence of the SV (Figure 4). The SNP elongates the 
micro-homology, or, alternatively, the micro-homology 
reflects a hyper-mutable state associated with break- 
induced replication around the SV [35]. However, this phe- 
nomenon is rare: we only observed five (4.5%) cases 
amongst our manually curated ancestral deletions 



(Additional file 9) where a SNP and SV formation co-seg- 
regate. We found a similar relationship between a SNP 
formed at the target site duplication and the presence of 
an ancestral insertion. Fifteen ancestral insertions (16%) 
had SNPs or short indels within their target site duplica- 
tion, coincident with the insertion (Additional file 9). 

Discussion 

Our findings are important in two ways. First, we show 
that an integrative approach using experimental analyses 
to train computational SV calling is essential for the accu- 
rate characterization of SV architecture. Second, we find a 
considerable complexity in SV formation; about a quarter 
of SVs in the mouse are composed of a complex mixture 
of deletion, insertion, inversion and copy number gain. 

In contrast to studies that start by identif)^ing SVs using 
automated genome-wide methods, followed by experimen- 
tal validation, we started by experimentally determining a 
set of SVs and then applied this information to interpret 
whole-genome automated SV detection [32]. Laboratory- 
based efforts proved essential for two main reasons. First, 
they allowed the correct interpretation of the PEM patterns. 
Without knowing how to interpret the underlying molecu- 
lar structure of each PEM, some patterns would be missed 
or classified incorrectly by computational methods alone. 
Second, our laboratory efforts allowed the recognition of a 
diversity of PEM patterns. Otherwise we would not be able 
to distinguish between simple and complex SVs. 

Finer-scale breakpoint sequence analysis reveals that 
24% of simple SVs have smaller rearrangements at the 
nucleotide level (micro-insertions or micro-deletions at 
the breakpoint of a larger SV). This raises questions 
about the likely mechanisms of SV formation. 

We know that retrotransposition is the commonest 
mechanism of SV formation in the mouse [32]. We also 
know that retrotransposons (LINEs, SINEs and long 
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Figure 4 Relationship between SNP and SV formation. Two SNPs lying on the 6 bp micro-homology of an ancestral deletion of 64 bp 
(chrl 2:27,040,459-27,040,522) correlated with the presence of the SV. Sequencing traces are shown for a test strain (A/J) and the reference strain 
(C57BL/6J). Note that all other test strain traces are identical to the one shown here. The black asterisks indicate the micro-homology of 6 bp 
(GAACTA). The presence of two SNPs (C->G and T->A, shown using red asterisks) in all test strains (here only shown in A/J) is associated with 
the presence of the structural variant. 



terminal repeats) are typically characterized by flanking 
target site duplications and a poly(A) tail or poly(T) head. 
However, we observed that 15% of retrotransposon SVs 
do not have target site duplications and truncated or 
absent poly(A) tails or poly(T) heads (Additional flle 9). 
Moran and colleagues [36] observed a similar phenom- 
enon in the human genome and suggested that retrotran- 
sposons, such as LINE-1 elements, integrate into DNA 
lesions, resulting in retrotransposon-mediated DNA 
repair. We suggest that about 15% of retrotransposon 
SVs in the mouse genome formed through a similar 
mechanism involving DNA repair. 

It is reasonable to assume that the complexities (micro- 
insertions and micro-deletions) we see at the breakpoints 
of ancestral deletions, inversions and gains (we call these 
'complex non-retrotransposon SVs) (Table 2) will corre- 
late with a complex mechanism of formation. A DNA 
replication fork stalling and template switching/microho- 
mology-mediated break-induced replication (FoSTeS/ 
MMBIR) mechanism has been proposed to generate such 
complex SVs in the human genome [37]. In addition, 
about half of our complex non-retrotransposon SVs have 
microhomology (short sequence of identical bases) ranging 
from 3 to 25 bp (Additional file 9), compatible with a 
microhomology-mediated break-induced replication pro- 
cess. It could be that the complex non-retrotransposon 
SVs are also the progeny of mutational processes during 
DNA replication. 

Interestingly, our estimate that 24% of SVs have micro- 
structures at their breakpoint is the same as that reported 
by Eichler and colleagues [30] in a study of human 



structural variation. Another sequencing-based study of 
SVs in two mouse strains (DBA/2J and C57BL/6J) exam- 
ined 3,316 breakpoints and reported that 16% of non- 
transposon structural variants are complex, as defined by 
multiple breakpoints mapped to within 1 kbp of each 
other [29] . However, we were not able to directly compare 
these results to ours since we have not used the same clas- 
sification criteria (we used a classification based on SVs 
being right next to each other, whereas Hall and collea- 
gues [29] was based on SVs being at close proximity). 

Ideally, sequencing longer reads would typically be 
required to resolve the complex architecture of struc- 
tural variants we report in this study, something that 
goes beyond the current generation sequencing plat- 
forms. Our findings offer an intermediate solution 
between next generation sequencing analysis and com- 
plete de novo assembly of genomes. 

Materials and methods 

Visual identification of PEM patterns 

We visually inspected short-read sequencing data using 
LookSeq [38] and manually detected PEM patterns across 
mouse chromosome 19 in its entirety and a random set 
of other chromosomal regions, accounting for 100 Mbp 
of total genomic regions. We analyzed molecular archi- 
tecture of these PEM patterns at nucleotide-level resolu- 
tion using PGR and Sanger-based sequencing. 

Experimental validation of SV architecture 

Primers were designed using Primer3 [39] and purchased 
from MWG (Ebersberg, Germany). Primer design strategy 
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was dependent on the type and size of the structural var- 
iant (Additional file 5). Three independent PGR reactions 
were carried out with Hotstar Taq obtained from Qiagen 
(Hilden, Germany). Reactions were performed as pre- 
viously described [40]. LongRange PGR kit (Qiagen) was 
used for genomic regions > 2 kbp. PGR gel images were 
then assessed for quality of primer design and perfor- 
mance of PGR reaction. Representative PGR examples are 
given in Additional file 1. We provide interested readers 
with PGR results for each SV site (Additional file 7). Gom- 
plex SV sites were subject to several rounds of primer 
design and PGR. PGR products were then purified in a 96- 
well Millipore purification plate, resuspended in 30 [i\ of 
H2O and sequenced as previously described [40]. All 
sequencing reactions were run out on an ABI3700 sequen- 
cer and assembled by using PHRED/PHRAP [41]. 

Assembly of Sanger-based sequencing data 

Gonsed was used for visualization and editing of the 
assembly [42]. Strains with and without the SV were 
aligned into one contig. Breakpoint analysis was mostly 
based on visual inspection of the alignment and BLAT 
search. The first breakpoint was identified when the 
strains with and without the SV stopped aligning and 
the second breakpoint when they started aligning again. 
Micro-insertion at the breakpoint was defined as any 
additional sequence relative to the reference genome 
(MGSGv37) in the strains with a deletion variant. 
Micro-deletion at the breakpoint was defined as any 
missing sequence relative to the reference genome 
(MGSGv37) in the strains with insertion variant. We 
also recorded the longest stretch of sequence identity at 
both breakpoints, which we defined as sequence micro- 
homology. When micro-homology occurred, we 
recorded the SV so that start and end coordinates were 
the smallest. For blunt ended SV, one set of start and 
end coordinates was recorded. BLAT was used to get 
the exact start and end coordinates of the SV. 

Genome-wide detection of structural variants 

Genome-wide structural variants were detected using 
four methods: split-read mapping (Pindel) [20], mate- 
pair analysis (BreakDancer) [18], single-end cluster ana- 
lysis (SEGluster and RetroSeq, unpubUshed), and read- 
depth (GND) [21]. Details of the complete pipeline, 
SVMerge, is described elsewhere [43]. We used in-house 
Perl scripts to detect genome-wide complex structural 
variants [32]. 
Data 

Data sets described in this study are available under 
study accession number 'estdl85' from the Database of 
Genomic Variants Archive (DGVa) [44]. Our previous 
genome-wide data of structural variants [32] are also 
available from DGVa under accession number 'estdllS'. 



Additional material 



Additional file 1: 21 PEM patterns. We found 11 'high-confidence' 
patterns and 10 'questionable' patterns. For each PEM, we provide PEM 
details, illustration using LookSeq [38] and PCR results. We show paired- 
end reads (black arrows) and how they map to the reference genome 
(dashed grey lines). Green arrows represent primer pairs. PCR was carried 
out across the founder strains of the HS [33]. We used Hyperladderll as 
size marker. 

Additional file 2: Chromosome 19 gold-standard data set. Columns 1 
to 3; chromosome, approximate SV start and end coordinates (bp). 
Column 4: SV length (bp). Column 5: PEM pattern (Table 1; Additional file 
1). Columns 6 tol3; strain distribution pattern (SDP) across eight classical 
strains (1 = SV present; 0 = SV absent). Column 14: has SV been PCRed 
(1 = yes). 

Additional file 3: Distribution of manual SV calls along chromosome 

19. The top horizontal tracks show the chromosomal distribution of 
manually identified structural variants (deletions, inversions and 
duplications) for specific mouse strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, 
CBA/J, DBA/2J and LP/J). The bottom two tracks represent genes 
(EnsembI 65) and gaps on chromosome 19. 

Additional file 4: Chromosome 19 manual SV calls that affect 
coding regions. Column 1: chromosome. Columns 2 and 3: SV start and 
stop coordinates (bp). Column 4: SV event. Column 5: affected gene (a 
plus sign indicates that the gene is affected in its entirety). Column 6: 
description of the gene. 

Additional file 5: Primer design strategy. We applied a primer design 
strategy depending on type and length of the SV. Forward primer is in 
green and reverse primer in red. SV sites were repeat masked prior to 
primer design, using RepeatMasker [45]. Breakpoints were initially 
predicted using LookSeq [38]. Primer design is illustrated for: (a) tandem 
duplication, (b) insertion, (c) deletion and (d) inversion. 

Additional file 6: Primers. For each primer pair (PP), we provide a 
primer pair identification, name and sequence of forward and reverse 
primers. 

Additional file 7: PCR data in eight classical strains. Column 1: 
chromosome. Columns 2 and 3: SV start and end coordinates (bp). 
Column 4: SV length. Column 5: PEM pattern G"able 1; Additional file 1). 
Columns 6 to 13: 1 = presence or 0 = absence of the SV (2, 3 and 4 
indicate multi-allelic SVs). Column 14: has the site been (= 1) or not (= 0) 
resolved at nucleotide level resolution (when column 14=1, columns 2 
and 3 refer to the exact coordinates, otherwise they are estimates). 
Column 15: primer coverage (number of primer pairs designed per 
unique SV site). Column 16: primer pair used to amplify the SV region. 

Additional file 8: Summary data of PCR and Sanger-based 
sequencing for each of the 21 PEM patterns. Column 1: PEM pattern 
(Table 1; Additional file 1). Column 2: number of unique SV sites PCRed. 
Column 3: chromosome 19 data (some cells are marked NA (not 
applicable) because we have not systematically inspected H6 and H7 
patterns). Column 4: predicted SV. Column 5: PCR validated SV. Column 
6: number of SVs sequenced at nucleotide level. Column 7: type of SV as 
simple, complex, false or variable number tandem repeat. 

Additional file 9: The 261 simple SV sites resolved at nucleotide 
level resolution using Sanger-based sequencing. Column 1: primer 
name. Columns 2 to 5: exact SV position. Column 6: PEM pattern. 
Column 7: length of any micro-deletion or micro-insertion at the SV 
breakpoint (bp). Columns 8 to 15: strain distribution pattern SDP. 
Columns 16 and 17: is there a SNP within the micro-homology (MH) or 
target site duplication (TSD) (0 = no, 1 = yes; NA, not applicable). 
Column 18: MH length (bp). Column 19: MH type. Column 20: TSD 
length. Column 21: origin of the inserted sequence. 
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bp: base pair; Del: deletion; Dup: duplication; kbp: kilobase pair; Ins: insertion; 
Inv: inversion; LINE: long interspersed nuclear element; Mbp: megabase pair; 
Nml: normal; PCR: polymerase chain reaction; SINE: short interspersed 
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